Red Wine by Ute Stohner

load.package(ggthemes)

Univariate Plots

Getting overview first

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##                    X        fixed.acidity     volatile.acidity 
##                FALSE                FALSE                FALSE 
##          citric.acid       residual.sugar            chlorides 
##                FALSE                FALSE                FALSE 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                FALSE                FALSE                FALSE 
##                   pH            sulphates              alcohol 
##                FALSE                FALSE                FALSE 
##              quality 
##                FALSE

Creating summaries for other variables

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   400.5   800.0   800.0  1199.5  1599.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## Length  Class   Mode 
##      0   NULL   NULL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Getting familiar with the file.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

I am assuming that quality is considered to be the dependent variable that we are trying to influence. Goal is to understand what influences wine quality. Not knowing much about wine making I could imagine that this helps wine producers to understand where to grow wine, which grapes to grow and also how to blend wine. Qualitiy is a qualititave (ordinal) value. All other values are quantitative.

Examining quality

What is the scale for quality?

## [1] 5 6 7 4 8 3

Finding out where most wines fall

Most wines have a quality of 5 and 6

Alcohol varies from 8.4 to 14.9. Median at 10.2

Does sugar influence alcohol content?

Does the graph look better with rounded alcohol?

Spike just at over 8, probably caused by an outlier in high sugar. How does it look with rounded numbers?

Alcohol does not rise or fall with higher sugar levels Creating alcohol_level to summarise alcohol for further plots.

Volatile acidity and Quality

With higher quality, volatile acidity goes down.

Fixed.acidity and volatile.acidity

According to this article: http://winemakersacademy.com/understanding-wine-acidity/ most wine drinkers perceive the total amount of acidity and not the volatile and fixed acidity separate. Adding a new column for total acidity

How does acidity influence quality?

Removing outliers in high acidity

Not a clear picture for total acidiy and quality

Same plot with volatile acidity plus line for the mean

Low volatile.acidity influences quality.

Adding in quality_factor, to have the data as a factor

Frequency polygon for acidity colored by quality factor

This graph is not very helpful, mainly as the numbers in our quality distribution are so different. Most wines fall into qualtiy 5 and 6. Trying a log scale

Frequency polygon for acidty colored by quality, log scale

With the log scale, the problem of the small size of high and low quality still persists.

pH and acidity are connected. http://chemistry.elmhurst.edu/vchembook/184ph.html Acidity starts at ph level 7. In the RedWine data ph ranges from 2.7 to 4.0. Selecting total.acidity and ph

Checking volatile and fixed acidity separately. Starting with volatile acidity.

Not the same picture as with total acidity. pH is going up with volatile acidity.

Fixed acidity

pH is going down with fixed acidity. It is not a good idea to combine acidity, as fixed and volatile acidity seem to be different.

Quality and pH

The graph does not show a strong correlation between pH and quality.

Density and Alcohol

Lower Density means higher quality.But it does not look like a strong influence. Again we have the problem of the uneven distribution of wines in the mediuam (5/6) quality section.

Distribution of Sulphates

Most wines have sulphates of around 0.6, sulphates fall rapidly, but some outliers of high sulphate wines.

Distribution of citric acid

Most citric acid values fall between 0 and 0.420 (quartile spread). The distribution is not even and shows outliers.

What is the structure of your dataset?

1599 obs. of 14 variables. Quality is ordinal with valus from 3 to 8. 8 being the best quality.

Most wines fall into quality levels of 5 and 6 Median alcohol is 10.2

What is/are the main feature(s) of interest in your dataset?

The main feature is quality. So far I can not tell what determines wine quality. Assumption is, it is a combination of variables.

What other features in the dataset do you think will help support your
It is too early to say. At this point I would concentrate on a combination of

alcohol, acidity, ph and density. Need more investigation

Did you create any new variables from existing variables in the dataset?

Total acidity was created. alcohol_round and quality_factor were created.

Of the features you investigated, were there any unusual distributions?

Looking at the number of wines, the majority of wines falls in 5 and 6. I would have assumed that volatile and total acidity can be combined. But comparing with pH, they seem to have different influence on the wine.

Bivariate plots

Running ggpairs

Correlation for quality

Focusing first on the correlation for quality. Only two variables have a weak influence on quality Quality/Volatile acidity: -0.391 Quality/Acohol: 0.476

## [1] -0.3905578
## [1] 0.4761663

Volatile acidity and quality

After exploring single variables in the last section, I would like to understand if there is a combination of variables that influences wine quality. The correlation coefficient suggests there is a weak influence of both volatile acidity and alcohol.

Boxplot for volatile.acidity and quality

The box plot shows that the acidity falls with higher quality. The spread of of data gets smaller as the quality increases. We find the smallest spread for quality 8, the highest for quality 4. It also shows that 5 and 6 have more outliers. It will be a good idea to remove outliers for volatile acidity in the next graphs.

Combining alcohol, quality and volatile acidity in one plot.

While this plot shows that the acidity falls in higher quality while alcohol rises, it does not show which combination of alcohol and acidity is ideal.

The correlation coefficient of Alcohol and volatile acidity does not suggest a connection Let’s verify

The correlation coefficient did not suggest a correlation between these two variables and this plot does neither. The median volatile acide goes down and than up with rising alcohol. The spread of the volatile acide data also does not differ that much between alcohol levels.

Alcohol and Density

Correlation of 0.496 was calculated for alcohol and density

Density goes down with higher alcohol level.

Residual sugar and density

Density goes up with sugar. Most of the sampples fall between 1.8 and 2.5 (approx.)

Fixed acidity and density

Density goes up with fixed acidity.

Citric acid and sulphates

Looking into citric.acid (0.226) and sulphates (0.251). These two variables are just under having a low correlation with qualtiy I want to look at those in the linear model in the next section

This plot shows the same weak correlation as the correleation coefficient.

Talk about some of the relationships you observed in this part of the

Alcohol and volatile acidity influence quality. Density and alcohol, Density and fixed acidity and Density and residual sugar are connected. For these last three, I don’t know if one is dependent and one is independent, or if they just appear naturally together

Did you observe any interesting relationships between the other features

Alcohol and Density are related. Density is related to fixed acidity and residual sugar. Apart from alcohol and volatile acidity I could not find other variables who influence quality direclty. This surprises me, I would have imagined that at least there can be an influence seen from density

What was the strongest relationship you found?

Fixed acidity and density have the highest correlation coefficient. Regarding quality, the strongest relation is still with alcohol.

Multivariate Plots Section

The correlation suggest that alcohol and volatile acidity influence quality.

Getting the median combination of alcohol and quality in the two highest quality levels #and in the medium (5 and 6) levels

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4055  0.4900  0.9150
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

Overall numbers to compare

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Is there a “sweet spot” for alcohol and volatile acidity?

In the graph below, the darkest dots are the wines of high quality. The top 1% of been removed to remove outliers

Again, this plot shows low(er) acidity and high(er) are a combination for a high quality wine. Put we can also see that there are points on the scatterplot of medium quality and even low quality wines that are very close to the high quality wines.

The same plot, but low quality wines highlighted with darker dots

High acidity seems to be a particlar sign of low quality wines. Still there are some medium quality (and even high quality) wines very closely located to the low quality wine.

Linear model

Linear model will be used (u=a_1+a_2*x) for qualtiy against alcohol, volatile acidity,density and pH. My main interest is to find the relation between quality, alcohol and volatile acidity

## 
## Calls:
## mod1: lm(formula = (quality ~ alcohol), data = RedWine)
## mod2: lm(formula = (quality ~ volatile.acidity), data = RedWine)
## mod3: lm(formula = (quality ~ density), data = RedWine)
## mod4: lm(formula = (quality ~ pH), data = RedWine)
## mod5: lm(formula = (quality ~ citric.acid), data = RedWine)
## mod6: lm(formula = (quality ~ sulphates), data = RedWine)
## 
## ========================================================================================================
##                         mod1          mod2          mod3          mod4          mod5          mod6      
## --------------------------------------------------------------------------------------------------------
##   (Intercept)           1.875***      6.566***     80.239***      6.636***      5.382***      4.848***  
##                        (0.175)       (0.058)      (10.508)       (0.433)       (0.034)       (0.078)    
##   alcohol               0.361***                                                                        
##                        (0.017)                                                                          
##   volatile.acidity                   -1.761***                                                          
##                                      (0.104)                                                            
##   density                                         -74.846***                                            
##                                                   (10.542)                                              
##   pH                                                             -0.302*                                
##                                                                  (0.131)                                
##   citric.acid                                                                   0.938***                
##                                                                                (0.101)                  
##   sulphates                                                                                   1.198***  
##                                                                                              (0.115)    
## --------------------------------------------------------------------------------------------------------
##   R-squared             0.227         0.153         0.031         0.003         0.051         0.063     
##   adj. R-squared        0.226         0.152         0.030         0.003         0.051         0.063     
##   sigma                 0.710         0.744         0.795         0.806         0.787         0.782     
##   F                   468.267       287.444        50.405         5.340        86.258       107.740     
##   p                     0.000         0.000         0.000         0.021         0.000         0.000     
##   Log-likelihood    -1721.057     -1794.312     -1901.790     -1923.965     -1884.577     -1874.438     
##   Deviance            805.870       883.198      1010.278      1038.692       988.760       976.300     
##   AIC                3448.114      3594.624      3809.580      3853.930      3775.155      3754.876     
##   BIC                3464.245      3610.756      3825.712      3870.062      3791.286      3771.008     
##   N                  1599          1599          1599          1599          1599          1599         
## ========================================================================================================

For every 0.175 more alcohol we can expect the quality to go up by 0.017

Fir every 0.058 increase in acidity we can expect the quality to go up by 0.104 22% of variation in quality can be explained by alcohol and 15% by volatile acidity.

Comparing alcohol, volatile acidity, density, sulphates and citric acid.

Boxplots to show values for high quality for: Alcohol, Volatile Acidity, Density,citric acid and sulphates

Is this the ideal wine?

Loooking closer at the results for alcohol and volatile acidity.

Calculating quartile spread for alcohol and volatile acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4055  0.4900  0.9150

The ideal combination?

The following graph shows all Red Wines whose Alcohol and Volatile Acidity levels are within the Quartils Range of High Quality Wine. The blue dots show the wines NOT of high quality. We can see that many lower quality wines share the acidity and alcohol characteristics of higher quality wine

##       X fixed.acidity volatile.acidity citric.acid residual.sugar
## 206 206          12.8             0.30        0.74            2.6
## 207 207          12.8             0.30        0.74            2.6
## 231 231           5.2             0.48        0.04            1.6
## 242 242          12.0             0.38        0.56            2.1
## 315 315           7.4             0.36        0.29            2.6
## 316 316           7.1             0.35        0.29            2.5
##     chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 206     0.095                   9                   28 0.99940 3.20
## 207     0.095                   9                   28 0.99940 3.20
## 231     0.054                  19                  106 0.99270 3.54
## 242     0.093                   6                   24 0.99925 3.14
## 315     0.087                  26                   72 0.99645 3.39
## 316     0.096                  20                   53 0.99620 3.42
##     sulphates alcohol quality alcohol_round     alcohol_level
## 206      0.77    10.8       7            11 Medium (10 to 12)
## 207      0.77    10.8       7            11 Medium (10 to 12)
## 231      0.62    12.2       7            12 Medium (10 to 12)
## 242      0.71    10.9       6            11 Medium (10 to 12)
## 315      0.68    11.0       5            11 Medium (10 to 12)
## 316      0.65    11.0       6            11 Medium (10 to 12)
##     total.acidity quality_factor alcohol_round_factor
## 206         13.10              7                   11
## 207         13.10              7                   11
## 231          5.68              7                   12
## 242         12.38              6                   11
## 315          7.76              5                   11
## 316          7.45              6                   11

Bar chart to visualize the amount of wines falling into the “ideal” acidity and alcohol level

Talk about some of the relationships you observed in this part of the
High alcohol and low volatile acidity help create a high quality wine.

It seems safe to say that high acidty and low alcohol is a combination for a low quality wine. The opposite is not that clear. There are low and medium quality wine that show a combination of high alcohol and low volatile acidity. I am assuming, that quality is measured by human testers and not by chemical components. There might be a component in high quality that is not captured in our data.

Were there any interesting or surprising interactions between features?

I was suprised by the relationship of pH, fixed acidity and volatile acidity. I would have thought that the aciditry influences pH in the same way.

OPTIONAL: Did you create any models with your dataset? Discuss the
I created a linear model.

The linear model did support the result of the correlation coefficient.


Final Plots and Summary

Plot One

## List of 1
##  $ plot.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 13
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Description One

The plot shows the distribution of our samples in the quality levels. Most of our samples fall into “medium” quality, 5 or 6. The sample is rather small, so this is interesting. Medium quality wine might not be something wine makers try to avoid. I am assuming more consumers purchase middle quality wines. So the goal might be to avoid low quality wines.

Plot Two

Description Two

Lower volatile acidity has a positive effect on quality. In this plot we can also wee that higher alcohol (green and blue points) increase the quality. Low volatile acidity and high alcohol make a high quality wine.

Plot Three

Description Three

Looking only at wines of higher quality (7 and 8). The majority of wines has a volatile acidity between 0.3 and 0.49 between and alcohol between 10.8 and 12.2. Are these the sign of high quality wines? Maybe not, the graph above highlights in blue all low and medium wines with the same characteristics. ——

Reflection

Even after investigating the data set, I feel something is missing. I would have expected to have a clearer picture about wine quality. Maybe this is caused by quality involving a human factor? While it is clear that certain characteristics make a bad wine, I am not sure I can confidenlty say what makes a good wine.

It would have helped to include more wine samples, possibly from more regions. Price would be a hepful data, as I am assuming in the end this is about helping winemakers to make profit.

I mostly struggled with not knowing enough about wine. I did research about some of the variables. The assignment would have been easier, if I would have had domain expertise.

This was a good exercise in R, as I had to go back to my lesson notes and use what I learned. I found it challenging to create plots that summarise the data. Since starting working on this project I am paying closer attention to plots I see in books, newspapers or online articles to get a better understanding what kind of plot works for which data. I do feel I need to learn more about using and building a model. I was mostly relying on what I had learned in Part 1 for linear regression.